80
This historical overview also summarizes briefly the essential problems and tasks of
databases: Ideally, each sequence is viewed by hand, analyzed with various bioinformatics
programs, and then accurately labeled. This is a lot of work, typically referred to as data
base maintenance. Since data sets in bioinformatics usually grow very quickly, this data
base maintenance is a chronic problem, often exacerbated by the fact that new databases
are usually created by a new project and then not maintained after the PhD thesis or post
doctoral project ends. Only a few large institutions, which are mentioned here and at other
places in the book, have enough staff to nevertheless maintain really well-maintained data,
in particular the NCBI, the EBI and the SBI (Swiss Bioinformatics Institute).
Other problems of databases are cross-linking to other data (this is also difficult due to
the constant growth of data), maintenance of content (especially when new types of con
tent are added), the number of errors or outdated entries.
For the protein databases UniProt and PDB (one of the oldest bioinformatics databases,
since the 1960s of the last century), as for many other databases, the uniform formatting
of entries is a problem. And of course it is not only difficult for BLAST to find entries
quickly and accurately in constantly growing databases. There are the two problems of
recall (sensitivity; how many of the hits are also stored in the database as real entries?) and
precision (specificity; do I find exactly what I am looking for or does my program suspect
that it could be half the database?).
6 Extremely Fast Sequence Comparisons Identify All the Molecules That Are Present…